{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predicting Survival on the Titanic\n",
    "\n",
    "### History\n",
    "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n",
    "\n",
    "### Assignment:\n",
    "\n",
    "Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.\n",
    "\n",
    "Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "# to handle datasets\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# for visualization\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# to divide train and test set\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# feature scaling\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# to build the models\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# to evaluate the models\n",
    "from sklearn.metrics import accuracy_score, roc_auc_score\n",
    "\n",
    "# to persist the model and the scaler\n",
    "import joblib\n",
    "\n",
    "# to visualise al the columns in the dataframe\n",
    "pd.pandas.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare the data set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pclass</th>\n",
       "      <th>survived</th>\n",
       "      <th>name</th>\n",
       "      <th>sex</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>ticket</th>\n",
       "      <th>fare</th>\n",
       "      <th>cabin</th>\n",
       "      <th>embarked</th>\n",
       "      <th>boat</th>\n",
       "      <th>body</th>\n",
       "      <th>home.dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Allen, Miss. Elisabeth Walton</td>\n",
       "      <td>female</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>24160</td>\n",
       "      <td>211.3375</td>\n",
       "      <td>B5</td>\n",
       "      <td>S</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>St Louis, MO</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Allison, Master. Hudson Trevor</td>\n",
       "      <td>male</td>\n",
       "      <td>0.9167</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.55</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>11</td>\n",
       "      <td>?</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Miss. Helen Loraine</td>\n",
       "      <td>female</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.55</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>?</td>\n",
       "      <td>?</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mr. Hudson Joshua Creighton</td>\n",
       "      <td>male</td>\n",
       "      <td>30</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.55</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>?</td>\n",
       "      <td>135</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mrs. Hudson J C (Bessie Waldo Daniels)</td>\n",
       "      <td>female</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.55</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>?</td>\n",
       "      <td>?</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pclass  survived                                             name     sex  \\\n",
       "0       1         1                    Allen, Miss. Elisabeth Walton  female   \n",
       "1       1         1                   Allison, Master. Hudson Trevor    male   \n",
       "2       1         0                     Allison, Miss. Helen Loraine  female   \n",
       "3       1         0             Allison, Mr. Hudson Joshua Creighton    male   \n",
       "4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   \n",
       "\n",
       "      age  sibsp  parch  ticket      fare    cabin embarked boat body  \\\n",
       "0      29      0      0   24160  211.3375       B5        S    2    ?   \n",
       "1  0.9167      1      2  113781    151.55  C22 C26        S   11    ?   \n",
       "2       2      1      2  113781    151.55  C22 C26        S    ?    ?   \n",
       "3      30      1      2  113781    151.55  C22 C26        S    ?  135   \n",
       "4      25      1      2  113781    151.55  C22 C26        S    ?    ?   \n",
       "\n",
       "                         home.dest  \n",
       "0                     St Louis, MO  \n",
       "1  Montreal, PQ / Chesterville, ON  \n",
       "2  Montreal, PQ / Chesterville, ON  \n",
       "3  Montreal, PQ / Chesterville, ON  \n",
       "4  Montreal, PQ / Chesterville, ON  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load the data - it is available open source and online\n",
    "\n",
    "data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n",
    "\n",
    "# display data\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace interrogation marks by NaN values\n",
    "\n",
    "data = data.replace('?', np.nan)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# retain only the first cabin if more than\n",
    "# 1 are available per passenger\n",
    "\n",
    "def get_first_cabin(row):\n",
    "    try:\n",
    "        return row.split()[0]\n",
    "    except:\n",
    "        return np.nan\n",
    "    \n",
    "data['cabin'] = data['cabin'].apply(get_first_cabin)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# extracts the title (Mr, Ms, etc) from the name variable\n",
    "\n",
    "def get_title(passenger):\n",
    "    line = passenger\n",
    "    if re.search('Mrs', line):\n",
    "        return 'Mrs'\n",
    "    elif re.search('Mr', line):\n",
    "        return 'Mr'\n",
    "    elif re.search('Miss', line):\n",
    "        return 'Miss'\n",
    "    elif re.search('Master', line):\n",
    "        return 'Master'\n",
    "    else:\n",
    "        return 'Other'\n",
    "    \n",
    "data['title'] = data['name'].apply(get_title)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cast numerical variables as floats\n",
    "\n",
    "data['fare'] = data['fare'].astype('float')\n",
    "data['age'] = data['age'].astype('float')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pclass</th>\n",
       "      <th>survived</th>\n",
       "      <th>sex</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>fare</th>\n",
       "      <th>cabin</th>\n",
       "      <th>embarked</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>29.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211.3375</td>\n",
       "      <td>B5</td>\n",
       "      <td>S</td>\n",
       "      <td>Miss</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>0.9167</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "      <td>Master</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>female</td>\n",
       "      <td>2.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "      <td>Miss</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>male</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "      <td>Mr</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>female</td>\n",
       "      <td>25.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "      <td>Mrs</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pclass  survived     sex      age  sibsp  parch      fare cabin embarked  \\\n",
       "0       1         1  female  29.0000      0      0  211.3375    B5        S   \n",
       "1       1         1    male   0.9167      1      2  151.5500   C22        S   \n",
       "2       1         0  female   2.0000      1      2  151.5500   C22        S   \n",
       "3       1         0    male  30.0000      1      2  151.5500   C22        S   \n",
       "4       1         0  female  25.0000      1      2  151.5500   C22        S   \n",
       "\n",
       "    title  \n",
       "0    Miss  \n",
       "1  Master  \n",
       "2    Miss  \n",
       "3      Mr  \n",
       "4     Mrs  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop unnecessary variables\n",
    "\n",
    "data.drop(labels=['name','ticket', 'boat', 'body','home.dest'], axis=1, inplace=True)\n",
    "\n",
    "# display data\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save the data set\n",
    "\n",
    "data.to_csv('titanic.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Exploration\n",
    "\n",
    "### Find numerical and categorical variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "target = 'survived'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of numerical variables: 5\n",
      "Number of categorical variables: 4\n"
     ]
    }
   ],
   "source": [
    "vars_num = [c for c in data.columns if data[c].dtypes!='O' and c!=target]\n",
    "\n",
    "vars_cat = [c for c in data.columns if data[c].dtypes=='O']\n",
    "\n",
    "print('Number of numerical variables: {}'.format(len(vars_num)))\n",
    "print('Number of categorical variables: {}'.format(len(vars_cat)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Find missing values in variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pclass    0.000000\n",
       "age       0.200917\n",
       "sibsp     0.000000\n",
       "parch     0.000000\n",
       "fare      0.000764\n",
       "dtype: float64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# first in numerical variables\n",
    "\n",
    "data[vars_num].isnull().mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sex         0.000000\n",
       "cabin       0.774637\n",
       "embarked    0.001528\n",
       "title       0.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# now in categorical variables\n",
    "\n",
    "data[vars_cat].isnull().mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Determine cardinality of categorical variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sex           2\n",
       "cabin       181\n",
       "embarked      3\n",
       "title         5\n",
       "dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[vars_cat].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Determine the distribution of numerical variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlYAAAJOCAYAAAB1IEnpAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAA+tElEQVR4nO3dfZhmdX3n+fdnaXkQlObB1GJ3xyaB0SEyKqlBXLKmoDMJAivsXsjiEkWC2zuzaDCSlcaZHZPsOIPXBhHNrJkeUXBCRIJm4FLHyAA1xpmFhFbGFlrXDjbSnYYWgdb2ueN3/7h/rUXbXXV396m6H+r9uq666jz8zn1/f32qfv2pc859TqoKSZIkHbj/ZtAFSJIkjQuDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaaF0mmk7xx0HVIkrSQDFaSJEkdMVhJkiR1xGClOSXZlOTqJA8leSrJh5Ic2tadl+SBJN9K8jdJztrD9r+Y5O4k30zyRJKbkyydsf6qJFuSfDvJV5KsastPTXJ/e+3Hk7x7wTotSbtJsqaNc99u4+H/2JYflOTaNr59LcmbklSSJW39kUluSLK1jXX/IslBg+2N5ovBSv26GPgN4BeBvwf8sySnAh8G/g9gKfBKYNMetg3wr4DnA38fWAH8HkCSFwJvAv5hVT2nvceu17geuL6qntve99bOeyVJ/fsb4L8HjgR+H/iTJMcB/yvwKuClwCnA+bttdyOwEzgBeBnw64DXoI6pJYMuQCPjj6rqUYAk7wTeBzwP+GBV3dnabNnThlW1EdjYZr/Rjjy9o83/HXAIcFKSb1TVphmb/gg4IcmxVfUEcG+XHZKkfVFVfzZj9qNJrgZOBS6k90fgZoAk1wC7jrxPAGcDS6vqe8B3klwHrAb+zULWr4XhESv169EZ04/QO/q0gt5fcLNKMpHklnYI/FvAnwDHwk9C11voHcHa1to9v216Gb2jY19O8tdJzu2qM5K0r5K8vl368HSSp4EX0xvLns8zx8iZ0y8AngVsnbHdvwF+bmGq1kIzWKlfK2ZM/zzwt/QGj1/sY9t/CRRwcjut95v0Tg8CUFV/WlW/Qm8AKuBdbflXq+q19AagdwG3JTm8g75I0j5J8gLg39K7dOGYqloKfIneWLYVWD6j+czx8lHgB8CxVbW0fT23qn5pYSrXQjNYqV+XJ1me5GjgnwIfBW4ALk2yKsl/k2RZkhftYdvnADuA7UmW0bsmC+hdY5XkzCSHAN8Hvgf8uK37zSTPq6ofA0+3TX48Xx2UpFkcTu8Pv28AJLmU3hEr6F3/eUUbA5cCV+3aqKq2Ap8Brk3y3DZW/mKSX13Q6rVgDFbq15/SGxwepnf6719U1V8BlwLXAduB/0TvqNPufp/eBZ3bgU8CH5+x7hDgGuAJ4DF6R6eubuvOAh5MsoPehewXtWsUJGlBVdVDwLXA/ws8DpwM/Oe2+t/SGx+/CHwB+BS9i9X/rq1/PXAw8BDwFHAbcNxC1a6FlaoadA0ackk2AW+sqv846FokadgleRXwx1W1pz80NeY8YiVJ0gFIcliSs5MsaZc7vAP480HXpcEwWEmSdGBC75KHp+idCtwA/POBVqSB8VSgJElSRzxiJUmS1JGhuPP6scceWytXruyr7Xe+8x0OP3w8b2Vk30bXOPdvX/q2bt26J6rqefNc0shyrOsZ577BePfPvvXMNtYNRbBauXIl999/f19tp6enmZqamt+CBsS+ja5x7t++9C3JI/NbzWhzrOsZ577BePfPvvXMNtZ5KlCSmiQfTLItyZdmLPu/k3w5yReT/Hm7AeSudVcn2ZjkK0l+YyBFSxoqBitJ+qkb6d2YdqY7gRdX1T8A/j/aDWyTnARcBPxS2+b/SXLQwpUqaRgZrCSpqarPAk/utuwzVbWzzd7LT58Jdx5wS1X9oKq+BmwETl2wYiUNpaG4xkqSRsRv0XtOJsAyekFrl81t2TMkWQ2sBpiYmGB6erqvN9qxY0ffbUfNOPcNxrt/9m1uBitJ6kOSf0rv+W8378t2VbUWWAswOTlZ/V4c60XCo2uc+2ff5mawkha5lWs+Oev6G88az49W74skbwDOBVbVT++qvAVYMaPZ8rZM+2iun8FN15yzQJVIB85rrCRpFknOAt4GvLqqvjtj1R3ARUkOSXI8cCLwV4OoUdLw8IiVJDVJPgJMAccm2UzvYbpXA4cAdyYBuLeq/nFVPZjkVuAheqcIL6+qvxtM5ZKGhcFKkpqqeu0eFt8wS/t3Au+cv4okjRpPBUqSJHXEYCVJktQRg5UkSVJHDFaSJEkdMVhJkiR1xGAlSZLUEYOVJElSRwxWkiRJHTFYSZIkdcRgJUmS1BGDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJH+gpWSZYmuS3Jl5NsSPKKJEcnuTPJV9v3o1rbJHlvko1JvpjklPntgiR1I8kHk2xL8qUZyxzrJPWt3yNW1wOfrqoXAS8BNgBrgLuq6kTgrjYP8CrgxPa1Gnh/pxVL0vy5EThrt2WOdZL6NmewSnIk8ErgBoCq+mFVPQ2cB9zUmt0EnN+mzwM+XD33AkuTHNdx3ZLUuar6LPDkbosd6yT1LVU1e4PkpcBa4CF6R6vWAVcAW6pqaWsT4KmqWprkE8A1VfW5tu4u4Kqqun+3111N7688JiYmfvmWW27pq+AdO3ZwxBFH9Nu/kWLfRtco92/9lu2zrj/+yIP67tsZZ5yxrqomu6hrUJKsBD5RVS9u80871nVr977N9TN48rIj57ukTi2mfTdO9qVvs411S/rYfglwCvDmqrovyfX89FA4AFVVSWZPaLupqrX0AhuTk5M1NTXV13bT09P023bU2LfRNcr9e8OaT866/sazDh/ZvnXNsa4bu/dtrp/BTRdPzbp+2CymfTdOuupbP9dYbQY2V9V9bf42ekHr8V2Hvdv3bW39FmDFjO2Xt2WSNIoc6yT1bc5gVVWPAY8meWFbtIreacE7gEvaskuA29v0HcDr2ydmTgO2V9XWbsuWpAXjWCepb/2cCgR4M3BzkoOBh4FL6YWyW5NcBjwCXNjafgo4G9gIfLe1laShl+QjwBRwbJLNwDuAa3Csk9SnvoJVVT0A7OkirVV7aFvA5QdWliQtvKp67V5WOdZJ6ku/R6wkSRo6K2e58H3TNecsYCVSj4+0kSRJ6ohHrCRJ827mkaUrT9455y0WpFHlEStJkqSOGKwkSZI6YrCSJEnqiMFKkiSpIwYrSZKkjvipQEnSWJrtHlfgfa40PzxiJUmS1BGDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaSJEkdMVhJUh+S/E6SB5N8KclHkhya5Pgk9yXZmOSjSQ4edJ2SBstgJUlzSLIM+G1gsqpeDBwEXAS8C7iuqk4AngIuG1yVkoaBwUqS+rMEOCzJEuDZwFbgTOC2tv4m4PzBlCZpWPhIG0maQ1VtSfKHwNeB7wGfAdYBT1fVztZsM7Bs922TrAZWA0xMTDA9Pd3Xe+7YsaPvtqPgypN3/mR64rBnzs9ltn+HfXmdfXndAzFu+24m+zY3g5UkzSHJUcB5wPHA08CfAWf1s21VrQXWAkxOTtbU1FRf7zk9PU2/bUfBG2Y8t+/Kk3dy7fr+//vZdPFUX6+7r2Z73QMxbvtuJvs2N08FStLcfg34WlV9o6p+BHwcOB1Y2k4NAiwHtgyqQEnDwWAlSXP7OnBakmcnCbAKeAi4B7igtbkEuH1A9UkaEgYrSZpDVd1H7yL1zwPr6Y2da4GrgLcm2QgcA9wwsCIlDQWvsZKkPlTVO4B37Lb4YeDUAZQjaUh5xEqSJKkjBitJkqSOGKwkSZI6YrCSJEnqiMFKkiSpIwYrSZKkjhisJEmSOtJ3sEpyUJIvJPlEmz8+yX1JNib5aJKD2/JD2vzGtn7lPNUuSZI0VPbliNUVwIYZ8+8CrquqE4CngMva8suAp9ry61o7SZKksddXsEqyHDgH+ECbD3AmvUc8ANwEnN+mz2vztPWrWntJkqSx1u8jbd4DvA14Tps/Bni6qna2+c3Asja9DHgUoKp2Jtne2j8x8wWTrAZWA0xMTDA9Pd1XITt27Oi77aixb6NrlPt35ck7Z10/yn2TpIU2Z7BKci6wrarWJZnq6o2rai29h5gyOTlZU1P9vfT09DT9th019m10jXL/3rDmk7Ouv/Gsw0e2bxoPK+f4GZWGST9HrE4HXp3kbOBQ4LnA9cDSJEvaUavlwJbWfguwAticZAlwJPDNziuXJGmezBbmNl1zzgJWolEz5zVWVXV1VS2vqpXARcDdVXUxcA9wQWt2CXB7m76jzdPW311V1WnVkiRJQ+hA7mN1FfDWJBvpXUN1Q1t+A3BMW/5WYM2BlShJkjQa+r14HYCqmgam2/TDwKl7aPN94DUd1CZJQyPJUnqfjH4xUMBvAV8BPgqsBDYBF1bVU4OpUNIw2KdgJUmL2PXAp6vqgnZD5GcDbwfuqqprkqyhd4T+qkEWOUheZC75SBtJmlOSI4FX0i55qKofVtXTPPO+fTPv5ydpkfKIlSTN7XjgG8CHkrwEWEfvaRQTVbW1tXkMmNh9w8V0z7657om2y8Rh/bedT7P9+85W31z7ZRT3Xb/s29wMVpI0tyXAKcCbq+q+JNez2wdzqqqS/MwnoBfTPfvmuifaLleevJNr1w/+v59NF0/tdd1sfZltOxjNfdcv+zY3TwVK0tw2A5ur6r42fxu9oPV4kuMA2vdtA6pP0pAwWEnSHKrqMeDRJC9si1YBD/HM+/bNvJ+fpEVq8MdiJWk0vBm4uX0i8GHgUnp/nN6a5DLgEeDCAdYnaQgYrCSpD1X1ADC5h1WrFrgUSUPMU4GSJEkdMVhJkiR1ZOROBa7fsn2vH4P1ieOSJGmQRi5YSZI0SHM9uufGsw5foEo0jDwVKEmS1BGDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaSJEkd8T5WWhCz3dgVvLmrJGk8eMRKkiSpIwYrSZKkjhisJEmSOuI1VpLUhyQHAfcDW6rq3CTHA7cAxwDrgNdV1Q8HWaP2zVzP/JP2h0esJKk/VwAbZsy/C7iuqk4AngIuG0hVkoaKwUqS5pBkOXAO8IE2H+BM4LbW5Cbg/IEUJ2moeCpQkub2HuBtwHPa/DHA01W1s81vBpbtacMkq4HVABMTE0xPT/f1hjt27Oi77bC48uSdczcCJg7rv+0omm3frd+yfa/bnbzsyHmqqDuj+HPZr676ZrCSpFkkORfYVlXrkkzt6/ZVtRZYCzA5OVlTU/29xPT0NP22HRaz3atupitP3sm168f3v58bzzp8r/tu1vv5XbznbYbJKP5c9qurvo3vT7YkdeN04NVJzgYOBZ4LXA8sTbKkHbVaDmwZYI2ShoTXWEnSLKrq6qpaXlUrgYuAu6vqYuAe4ILW7BLg9gGVKGmIGKwkaf9cBbw1yUZ611zdMOB6JA0BTwVKUp+qahqYbtMPA6cOsh6Nl7nuq+UzVUfDnEeskqxIck+Sh5I8mOSKtvzoJHcm+Wr7flRbniTvTbIxyReTnDLfnZAkSRoG/Ryx2glcWVWfT/IcYF2SO4E3AHdV1TVJ1gBr6B0afxVwYvt6OfD+9l2SNMK8U7k0tzmDVVVtBba26W8n2UDvfi3nAVOt2U30Do9f1ZZ/uKoKuDfJ0iTHtdeRJPVh/Zbts38039NC0lDap2uskqwEXgbcB0zMCEuPARNtehnw6IzNdt047xnBan9vmjfbjeVG/aZl43zjtbluCDjq/R7lfTfXjRpHuW+StND6DlZJjgA+Brylqr7Ve6JDT1VVktqXN97fm+a97+bb93pjuVG4udpsxvnGa7PtN3DfDdJcN3Wc7WaHkn7WXEcbNd76ut1CkmfRC1U3V9XH2+LHkxzX1h8HbGvLtwArZmzujfMkSdKi0M+nAkPv/iwbqurdM1bdQe+mePDMm+PdAby+fTrwNGC711dJkqTFoJ9TgacDrwPWJ3mgLXs7cA1wa5LLgEeAC9u6TwFnAxuB7wKXdlmwJEnSsOrnU4GfA7KX1av20L6Ayw+wLkmSpJHjI20kSZI6YrCSJEnqiMFKkiSpIwYrSZKkjhisJEmSOrJPj7SRJEmDMdtDsH125PAwWEmSNOYMZQvHYCVJc0iyAvgwvYfNF7C2qq5PcjTwUWAlsAm4sKqeWoia/I9SGk5eYyVJc9sJXFlVJwGnAZcnOQlYA9xVVScCd7V5SYuYwUqS5lBVW6vq823628AGYBlwHnBTa3YTcP5ACpQ0NDwVKEn7IMlK4GXAfcDEjIfMP0bvVOHu7VcDqwEmJiaYnp7u630mDoMrT965XzX2+x77an/r2d2B9G0UDKJ/c+3z2erZl5+XHTt2zNvP16B11TeDlST1KckRwMeAt1TVt5KfPka1qipJ7b5NVa0F1gJMTk7W1NRUX+/1vptv59r1+zdEb7q4v/fYV2+Y5bqufXHlyTv3u2+jYBD9m2ufz7bv9uXnZXp6mn5/hkdNV30b359sSepQkmfRC1U3V9XH2+LHkxxXVVuTHAdsG1yFWsxm+zCDFpbXWEnSHNI7NHUDsKGq3j1j1R3AJW36EuD2ha5N0nDxiJUkze104HXA+iQPtGVvB64Bbk1yGfAIcOFgypM0LAxWkjSHqvockL2sXrWQtfRjrtNC3udK/dr9Z+nKk3c+43otf5Z+lqcCJUmSOuIRK0laZLxru7riz9LPMlhJkn7CT5dJB8ZTgZIkSR0xWEmSJHXEU4GSJC1inv7tlkesJEmSOmKwkiRJ6ojBSpIkqSMGK0mSpI4YrCRJkjpisJIkSeqIt1uQJElDZZQflWOwkiRJnRvlcHQgPBUoSZLUkXkJVknOSvKVJBuTrJmP95CkYeB4J2mmzoNVkoOAfw28CjgJeG2Sk7p+H0kaNMc7Sbubj2usTgU2VtXDAEluAc4DHpqH95KkQXK8k/bDgTyfcL6ebXjjWYd38jqpqk5e6CcvmFwAnFVVb2zzrwNeXlVv2q3damB1m30h8JU+3+JY4ImOyh029m10jXP/9qVvL6iq581nMcOkn/HOsW6PxrlvMN79s289ex3rBvapwKpaC6zd1+2S3F9Vk/NQ0sDZt9E1zv0b574tBMe6nzXOfYPx7p99m9t8XLy+BVgxY355WyZJ48bxTtIzzEew+mvgxCTHJzkYuAi4Yx7eR5IGzfFO0jN0fiqwqnYmeRPwF8BBwAer6sEO32KfD6mPEPs2usa5f+PctwMyz+PdOP+7j3PfYLz7Z9/m0PnF65IkSYuVd16XJEnqiMFKkiSpI0MZrJJ8MMm2JF/ay/okeW97hMQXk5yy0DXurz76NpVke5IH2tc/X+ga91eSFUnuSfJQkgeTXLGHNiO57/rs2yjvu0OT/FWS/9r69/t7aHNIko+2fXdfkpUDKHVRGKfH5OztdyfJ0UnuTPLV9v2oQde6v5IclOQLST7R5o9vvyMb2+/MwYOucX8kWZrktiRfTrIhySvGZb8l+Z328/ilJB9pY2An+20ogxVwI3DWLOtfBZzYvlYD71+AmrpyI7P3DeAvq+ql7esPFqCmruwErqyqk4DTgMv38HiPUd13/fQNRnff/QA4s6peArwUOCvJabu1uQx4qqpOAK4D3rWwJS4OGb/H5Oztd2cNcFdVnQjc1eZH1RXAhhnz7wKua78rT9H73RlF1wOfrqoXAS+h18eR329JlgG/DUxW1YvpffDkIjrab0MZrKrqs8CTszQ5D/hw9dwLLE1y3MJUd2D66NvIqqqtVfX5Nv1ter+Ey3ZrNpL7rs++jay2P3a02We1r90/2XIecFObvg1YlSQLVOJi8pPH5FTVD4Fdj8kZSbP87sz8eboJOH8gBR6gJMuBc4APtPkAZ9L7HYER7VuSI4FXAjcAVNUPq+ppxmS/0bsrwmFJlgDPBrbS0X4bymDVh2XAozPmNzNG/8kBr2inZP5Dkl8adDH7o50mehlw326rRn7fzdI3GOF9105nPABsA+6sqr3uu6raCWwHjlnQIheHkf8d2ZvdfncmqmprW/UYMDGoug7Qe4C3AT9u88cAT7ffERjd/Xc88A3gQ+005weSHM4Y7Leq2gL8IfB1eoFqO7COjvbbqAarcfZ5es8gegnwPuDfD7acfZfkCOBjwFuq6luDrqdLc/RtpPddVf1dVb2U3t3DT03y4gGXpDEy2+9O9e77M3L3/klyLrCtqtYNupZ5sAQ4BXh/Vb0M+A67nfYb4f12FL0jb8cDzwcOZ+5LdPo2qsFqbB8jUVXf2nVKpqo+BTwrybEDLqtvSZ5Fb/C8uao+vocmI7vv5urbqO+7Xdrh/nv42YHmJ/uuHT4/Evjmgha3OIzs78je7OV35/FdlwG079sGVd8BOB14dZJN9E7ZnknvuqSl7XcERnf/bQY2zzhyfRu9oDUO++3XgK9V1Teq6kfAx+nty07226gGqzuA17dPmJ0GbJ9xaHKkJflvd123kuRUevtoJP7zanXfAGyoqnfvpdlI7rt++jbi++55SZa26cOAfwR8ebdmdwCXtOkLgLvLOwzPh7F6TM4svzszf54uAW5f6NoOVFVdXVXLq2olvf10d1VdTO8Pkwtas1Ht22PAo0le2BatAh5iDPYbvVOApyV5dvv53NW3TvbbUN55PclHgCngWOBx4B30Lqalqv64/UP8Eb2/qL8LXFpV9w+m2n3TR9/eBPwTep+k+R7w1qr6L4Opdt8k+RXgL4H1/PR6g7cDPw+jve/67Nso77t/QO9izYPoBcJbq+oPkvwBcH9V3ZHkUODf0btG5kngoqp6eGBFj7EkZ9O7dmfXY3LeOdiK9t8svzv3AbfS+x16BLiwqkb2gz1JpoDfrapzk/wCvSNYRwNfAH6zqn4wwPL2S5KX0rso/2DgYeBS2vjAiO+39G4p8z/TG6+/ALyR3jVVB7zfhjJYSZIkjaJRPRUoSZI0dAxWkiRJHTFYSZIkdcRgJUmS1BGDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaSJEkdMVhJkiR1xGAlSZLUEYOVJElSRwxWOmBJ3p7kA216ZZJKsmTQdUnSsErye0n+ZNB1qHv+56cDVlX/ctA1SJI0DDxiJUlSx9Lj/7GLkDtd+yTJVUm2JPl2kq8kWbWXQ9q/leRvk2xN8rsztj81yf1JvpXk8STvbst3nUJcvaftJGmhJNmU5OokDyV5KsmHkhya5Kgkn0jyjbb8E0mWz9huOsk7k/xn4LvALyT5pSR3JnmyjXlvn/FWByf5cBtPH0wyueCdVecMVupbkhcCbwL+YVU9B/gNYNNemp8BnAj8OnBVkl9ry68Hrq+q5wK/CNza53aStJAupjfG/SLw94B/Ru//zA8BLwB+Hvge8Ee7bfc6YDXwHOBx4D8CnwaeD5wA3DWj7auBW4ClwB17eC2NIIOV9sXfAYcAJyV5VlVtqqq/2Uvb36+q71TVenoD0Wvb8h8BJyQ5tqp2VNW9fW4nSQvpj6rq0ap6Engn8Nqq+mZVfayqvltV327Lf3W37W6sqgeraidwLvBYVV1bVd+vqm9X1X0z2n6uqj5VVX8H/DvgJQvRMc0vg5X6VlUbgbcAvwdsS3JLkufvpfmjM6YfoffXGsBl9P76+3KSv05ybp/bSdJC+pmxKMmzk/ybJI8k+RbwWWBpkoP2st0KYG9/fAI8NmP6u8ChfqJ69BmstE+q6k+r6lfoHQov4F17abpixvTPA3/btv9qVb0W+Lm27W1JDp9rO0laYHsai64EXgi8vF3O8Mq2PjPa1ozpR4FfmM8iNXwMVupbkhcmOTPJIcD36V1f8OO9NP8/2193vwRcCny0vcZvJnleVf0YeLq1/fFc20nSArs8yfIkRwP/lN5Y9Bx6497Tbfk75niNTwDHJXlLkkOSPCfJy+e3bA2awUr74hDgGuAJeoewfw64ei9t/xOwkd6Fmn9YVZ9py88CHkyyg96F7BdV1ff62E6SFtKfAp8BHqZ3Ou9fAO8BDqM3Bt5L76L0vWrXYf0j4H+gN2Z+ld4HdDTGUlVzt5LmWZKVwNeAZ7WLPiVpIJJsAt5YVf9x0LVo9HjESpIkqSMGK0mSpI54KlCSJKkjHrGSJEnqiMFKkiSpI0Nxh9djjz22Vq5c2Vfb73znOxx++OFzNxxB9m10jXP/9qVv69ate6KqnjfPJY2sxTjWjUM/7MNwGKY+zDbWDUWwWrlyJffff39fbaenp5mamprfggbEvo2uce7fvvQtySPzW81oW4xj3Tj0wz4Mh2Hqw2xjnacCJUmSOmKwkiRJ6ojBStKikuSDSbYl+dKMZUcnuTPJV9v3o9ryJHlvko1JvpjklBnbXNLafzXJJYPoi6ThY7CStNjcSO+ZlTOtAe6qqhPpPadyTVv+KuDE9rUaeD/0ghi9B/C+HDgVeMeuMCZpcTNYSVpUquqzwJO7LT4PuKlN3wScP2P5h6vnXmBpkuOA3wDurKonq+op4E5+NqxJWoSG4lOB+2L9lu28Yc0n97hu0zXnLHA1ksbERFVtbdOPARNtehnw6Ix2m9uyvS3/GUlW0zvaxcTEBNPT030VtO3J7bzv5tv3uv7kZUf29TqDtmPHjr77PKzsw3AYlT6MXLCSpPlUVZWks2d9VdVaYC3A5ORk9ftx8ffdfDvXrt/7EL3p4v5eZ9CG6SPy+8s+DIdR6YOnAiUJHm+n+Gjft7XlW4AVM9otb8v2tlzSImewkiS4A9j1yb5LgNtnLH99+3TgacD2dsrwL4BfT3JUu2j919sySYucpwIlLSpJPgJMAccm2Uzv033XALcmuQx4BLiwNf8UcDawEfgucClAVT2Z5P8C/rq1+4Oq2v2CeEmLkMFK0qJSVa/dy6pVe2hbwOV7eZ0PAh/ssDRJY8BTgZIkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaSJEkd6StYJfmdJA8m+VKSjyQ5NMnxSe5rT33/aJKDW9tD2vzGtn7lvPZAkiRpSMwZrJIsA34bmKyqFwMHARcB7wKuq6oTgKeAy9omlwFPteXXtXaSJEljr99TgUuAw5IsAZ4NbAXOBG5r63d/Gvyup8TfBqxKkk6qlSRJGmJz3iC0qrYk+UPg68D3gM8A64Cnq2pnazbzye4/eep7Ve1Msh04Bnhi5uvu7xPfJw6DK0/eucd1o/DU69mMypO798c49w3Gu3/j3DdJ6tqcwao9B+s84HjgaeDPgLMO9I3n44nvo/K0970ZlSd3749x7huMd//GuW+S1LV+TgX+GvC1qvpGVf0I+DhwOrC0nRqEZz7Z/SdPfW/rjwS+2WnVkiRJQ6ifYPV14LQkz27XSq0CHgLuAS5obXZ/Gvyup8RfANzdnrclSZI01uYMVlV1H72L0D8PrG/brAWuAt6aZCO9a6huaJvcABzTlr8VWDMPdUuSJA2dOa+xAqiqdwDv2G3xw8Cpe2j7feA1B16aJEnSaPHO65IkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaSJEkdMVhJkiR1xGAlSU2S30nyYJIvJflIkkOTHJ/kviQbk3w0ycGt7SFtfmNbv3LA5UsaAgYrSQKSLAN+G5isqhcDBwEXAe8CrquqE4CngMvaJpcBT7Xl17V2khY5g5Uk/dQS4LD2nNNnA1uBM+k9fQLgJuD8Nn1em6etX9Ue+yVpEevrzuuSNO6qakuSP6T3fNTvAZ8B1gFPV9XO1mwzsKxNLwMebdvuTLKd3uO9npj5uklWA6sBJiYmmJ6e7queicPgypN37nV9v68zaDt27BiZWvfGPgyHUemDwUqSgCRH0TsKdTzwNPBnwFkH+rpVtZbe81WZnJysqampvrZ73823c+36vQ/Rmy7u73UGbXp6mn77PKzsw3AYlT54KlCSen4N+FpVfaOqfgR8HDgdWNpODQIsB7a06S3ACoC2/kjgmwtbsqRhY7CSpJ6vA6cleXa7VmoV8BBwD3BBa3MJcHubvqPN09bfXVW1gPVKGkIGK0kCquo+ehehfx5YT298XAtcBbw1yUZ611Dd0Da5ATimLX8rsGbBi5Y0dLzGSpKaqnoH8I7dFj8MnLqHtt8HXrMQdUkaHR6xkiRJ6ojBSpIkqSMGK0mSpI4YrCRJkjpisJIkSeqIwUqSJKkjBitJkqSOGKwkSZI6YrCSJEnqSF/BKsnSJLcl+XKSDUlekeToJHcm+Wr7flRrmyTvTbIxyReTnDK/XZAkSRoO/R6xuh74dFW9CHgJsIHec7HuqqoTgbv46XOyXgWc2L5WA+/vtGJJkqQhNWewSnIk8Erag0er6odV9TRwHnBTa3YTcH6bPg/4cPXcCyxNclzHdUuSJA2dfh7CfDzwDeBDSV4CrAOuACaqamtr8xgw0aaXAY/O2H5zW7Z1xjKSrKZ3RIuJiQmmp6f7KnjiMLjy5J17XNfvawyrHTt2jHwf9mac+wbj3b9x7pskda2fYLUEOAV4c1Xdl+R6fnraD4CqqiS1L29cVWuBtQCTk5M1NTXV13bvu/l2rl2/57I3Xdzfawyr6elp+v13GDXj3DcY7/6Nc98kqWv9XGO1GdhcVfe1+dvoBa3Hd53ia9+3tfVbgBUztl/elkmSJI21OYNVVT0GPJrkhW3RKuAh4A7gkrbsEuD2Nn0H8Pr26cDTgO0zThlKkiSNrX5OBQK8Gbg5ycHAw8Cl9ELZrUkuAx4BLmxtPwWcDWwEvtvaSpIkjb2+glVVPQBM7mHVqj20LeDyAytLkiRp9HjndUmSpI4YrCSp8SkTkg6UwUqSfsqnTEg6IAYrScKnTEjqRr+fCpSkcTcyT5mA0XnSxDjcud8+DIdR6YPBSpJ6RuYpEzA6T5oYhzv324fhMCp98FSgJPX4lAlJB8xgJUn4lAlJ3fBUoCT9lE+ZkHRADFaS1PiUCUkHylOBkiRJHTFYSZIkdcRgJUmS1BGDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaSJEkdMVhJkiR1xGAlSZLUEYOVJElSRwxWkiRJHek7WCU5KMkXknyizR+f5L4kG5N8NMnBbfkhbX5jW79ynmqXJEkaKvtyxOoKYMOM+XcB11XVCcBTwGVt+WXAU235da2dJEnS2OsrWCVZDpwDfKDNBzgTuK01uQk4v02f1+Zp61e19pIkSWNtSZ/t3gO8DXhOmz8GeLqqdrb5zcCyNr0MeBSgqnYm2d7aPzHzBZOsBlYDTExMMD093VchE4fBlSfv3OO6fl9jWO3YsWPk+7A349w3GO/+jXPfJKlrcwarJOcC26pqXZKprt64qtYCawEmJydraqq/l37fzbdz7fo9l73p4v5eY1hNT0/T77/DqBnnvsF492+c+yZJXevniNXpwKuTnA0cCjwXuB5YmmRJO2q1HNjS2m8BVgCbkywBjgS+2XnlkiRJQ2bOa6yq6uqqWl5VK4GLgLur6mLgHuCC1uwS4PY2fUebp62/u6qq06olSZKG0IHcx+oq4K1JNtK7huqGtvwG4Ji2/K3AmgMrUZIWjreWkXQg9ilYVdV0VZ3bph+uqlOr6oSqek1V/aAt/36bP6Gtf3g+CpekeeKtZSTtN++8LkmNt5aRdKD6vd2CJC0G72EEbi0Do3N7mXG4XYd9GA6j0geDlSQxWreWgdG5vcw43K7DPgyHUemDwUqSery1jKQD5jVWkoS3lpHUDYOVJM3OW8tI6punAiVpN1U1DUy36YeBU/fQ5vvAaxa0MElDz2DVrFzzyVnXb7rmnAWqRJIkjSpPBUqSJHXEYCVJktQRg5UkSVJHDFaSJEkdMVhJkiR1xGAlSZLUEYOVJElSRwxWkiRJHTFYSZIkdcRgJUmS1BGDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJH5gxWSVYkuSfJQ0keTHJFW350kjuTfLV9P6otT5L3JtmY5ItJTpnvTkiSJA2Dfo5Y7QSurKqTgNOAy5OcBKwB7qqqE4G72jzAq4AT29dq4P2dVy1JkjSE5gxWVbW1qj7fpr8NbACWAecBN7VmNwHnt+nzgA9Xz73A0iTHdV24JEnSsFmyL42TrAReBtwHTFTV1rbqMWCiTS8DHp2x2ea2bOuMZSRZTe+IFhMTE0xPT/dVw8RhcOXJO/e4rt/X2JO9vWYXr92vHTt2LMj7DMI49w3Gu3/j3DdJ6lrfwSrJEcDHgLdU1beS/GRdVVWS2pc3rqq1wFqAycnJmpqa6mu79918O9eu33PZmy7u7zX25A1rPjnr+gN57X5NT0/T77/DqBnnvsF492+c+zZTkhXAh+n9kVjA2qq6PsnRwEeBlcAm4MKqeiq9QfB64Gzgu8Abdh3dl7R49fWpwCTPoheqbq6qj7fFj+86xde+b2vLtwArZmy+vC2TpGHm9aSSDlg/nwoMcAOwoarePWPVHcAlbfoS4PYZy1/fPh14GrB9xilDSRpKXk8qqQv9nAo8HXgdsD7JA23Z24FrgFuTXAY8AlzY1n2K3qHxjfQOj1/aZcGSNN+G/XpSWJjrPrswDtfo2YfhMCp9mDNYVdXngOxl9ao9tC/g8gOsS5IGYhSuJ4WFue6zC+NwjZ59GA6j0gfvvC5JjdeTSjpQBitJwutJJXVjn+5jJUljbNFcT7pyltvLbLrmnAWsRBo/BitJwutJJXXDU4GSJEkdMVhJkiR1xGAlSZLUEYOVJElSRwxWkiRJHTFYSZIkdcTbLUiSOrGn+2NdefJO3rDmk94fS4uGR6wkSZI6YrCSJEnqiMFKkiSpIwYrSZKkjhisJEmSOmKwkiRJ6ojBSpIkqSMGK0mSpI4YrCRJkjrindcHbOadinfdoXgX71QsSdJo8YiVJElSRwxWkiRJHZmXU4FJzgKuBw4CPlBV18zH+2j/7OlBqbvM1+nH9Vu2P+M050K9rzTfHO8kzdT5EaskBwH/GngVcBLw2iQndf0+kjRojneSdjcfR6xOBTZW1cMASW4BzgMemof3knSAZjuCCXDjWYcvUCUjyfFuwAZxBF6azXwEq2XAozPmNwMvn4f3keYMBQ6smmeOd2Nqvj6xPaggONvlGItpnFyIf/9UVScv9JMXTC4AzqqqN7b51wEvr6o37dZuNbC6zb4Q+Eqfb3Es8ERH5Q4b+za6xrl/+9K3F1TV8+azmGHSz3jnWDcW/bAPw2GY+rDXsW4+jlhtAVbMmF/elj1DVa0F1u7riye5v6om97+84WXfRtc492+c+9aBOce7xT7WjUM/7MNwGJU+zMftFv4aODHJ8UkOBi4C7piH95GkQXO8k/QMnR+xqqqdSd4E/AW9jx9/sKoe7Pp9JGnQHO8k7W5e7mNVVZ8CPjUfr81+HFIfIfZtdI1z/8a5bwdsHse7cfl3H4d+2IfhMBJ96PzidUmSpMXKR9pIkiR1ZKSCVZKzknwlycYkawZdT1eSrEhyT5KHkjyY5IpB19S1JAcl+UKSTwy6li4lWZrktiRfTrIhySsGXVNXkvxO+3n8UpKPJDl00DUtFqM+1o3TmDbqY9c4jFGjNhaNTLAa80dH7ASurKqTgNOAy8eob7tcAWwYdBHz4Hrg01X1IuAljEkfkywDfhuYrKoX07sw+6LBVrU4jMlYN05j2qiPXSM9Ro3iWDQywYoZj46oqh8Cux4dMfKqamtVfb5Nf5veD/6ywVbVnSTLgXOADwy6li4lORJ4JXADQFX9sKqeHmhR3VoCHJZkCfBs4G8HXM9iMfJj3biMaaM+do3RGDVSY9EoBas9PTpi5H5R55JkJfAy4L4Bl9Kl9wBvA3484Dq6djzwDeBD7VTBB5KMxYP1qmoL8IfA14GtwPaq+sxgq1o0xmqsG/Ex7T2M9tg18mPUKI5FoxSsxl6SI4CPAW+pqm8Nup4uJDkX2FZV6wZdyzxYApwCvL+qXgZ8Bxi562H2JMlR9I6SHA88Hzg8yW8OtiqNmlEe08Zk7Br5MWoUx6JRClZ9PSpnVCV5Fr0B6Oaq+vig6+nQ6cCrk2yid0rjzCR/MtiSOrMZ2FxVu/4Sv43eIDYOfg34WlV9o6p+BHwc+O8GXNNiMRZj3RiMaeMwdo3DGDVyY9EoBauxfXREktA7B76hqt496Hq6VFVXV9XyqlpJb5/dXVVD/ddGv6rqMeDRJC9si1YBDw2wpC59HTgtybPbz+cqRuyi1xE28mPdOIxp4zB2jckYNXJj0bzceX0+jPmjI04HXgesT/JAW/b2dkdnDbc3Aze3/wAfBi4dcD2dqKr7ktwGfJ7eJ7y+wIjc9XjUjclY55g2PEZ6jBrFscg7r0uSJHVklE4FSpIkDTWDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaSJEkdMVhJkiR1xGAlSZLUEYOVJElSRwxWkiRJHTFYSZIkdcRgJUmS1BGDlSRJUkcMVjogSV6Y5IEk307y24OuR5KkQVoy6AI08t4G3FNVLx10IZIkDZpHrHSgXgA8uK8bJTHUS5LGjsFK+y3J3cAZwB8l2ZHkiiRfSPKtJI8m+b0ZbVcmqSSXJfk6cHdb/ltJNiR5KslfJHnBYHojSdKBM1hpv1XVmcBfAm+qqiOA/wq8HlgKnAP8kyTn77bZrwJ/H/iNJOcBbwf+J+B57bU+siDFS5I0DwxW6kxVTVfV+qr6cVV9kV5I+tXdmv1eVX2nqr4H/GPgX1XVhqraCfxL4KUetZIkjSqDlTqT5OVJ7knyjSTb6QWnY3dr9uiM6RcA1yd5OsnTwJNAgGULUrAkSR0zWKlLfwrcAayoqiOBP6YXlGaqGdOPAv9bVS2d8XVYVf2XBapXkqROGazUpecAT1bV95OcCvwvc7T/Y+DqJL8EkOTIJK+Z7yIlSZovBit16X8H/iDJt4F/Dtw6W+Oq+nPgXcAtSb4FfAl41bxXKUnSPElVzd1KkiRJc/KIlSRJUkcMVpIkSR0xWEmSJHXEYCVJktSRoXgQ7rHHHlsrV67sq+13vvMdDj/88PktqCPWOj+sdX50Ueu6deueqKrndVSSJI2coQhWK1eu5P777++r7fT0NFNTU/NbUEesdX5Y6/zootYkj3RTjSSNJk8FSpIkdcRgJUmS1BGDlSRJUkcMVpIkSR0xWEmSJHVkKD4VuC/Wb9nOG9Z8co/rNl1zzgJXI0mS9FMesZIkSeqIwUqSJKkjBitJkqSOGKwkSZI6YrCSJEnqiMFKkiSpIwYrSZKkjhisJEmSOmKwkiRJ6ojBSpIkqSMGK0mSpI4YrCRJkjpisJIkSeqIwUqSJKkjBitJkqSOGKwkSZI6YrCSJEnqiMFKkiSpIwYrSZKkjhisJEmSOmKwkiRJ6ojBSpIkqSN9BaskS5PcluTLSTYkeUWSo5PcmeSr7ftRrW2SvDfJxiRfTHLK/HZBkiRpOPR7xOp64NNV9SLgJcAGYA1wV1WdCNzV5gFeBZzYvlYD7++0YkmSpCE1Z7BKciTwSuAGgKr6YVU9DZwH3NSa3QSc36bPAz5cPfcCS5Mc13HdkiRJQydVNXuD5KXAWuAheker1gFXAFuqamlrE+Cpqlqa5BPANVX1ubbuLuCqqrp/t9ddTe+IFhMTE798yy239FXwtie38/j39rzu5GVH9vUaC2XHjh0cccQRgy6jL9Y6PxZbrWeccca6qprsqCRJGjlL+mxzCvDmqrovyfX89LQfAFVVSWZPaLupqrX0AhuTk5M1NTXV13bvu/l2rl2/57I3XdzfayyU6elp+u3XoFnr/LBWSVpc+rnGajOwuarua/O30Qtaj+86xde+b2vrtwArZmy/vC2TJEkaa3MGq6p6DHg0yQvbolX0TgveAVzSll0C3N6m7wBe3z4deBqwvaq2dlu2JEnS8OnnVCDAm4GbkxwMPAxcSi+U3ZrkMuAR4MLW9lPA2cBG4LutrSRJ0tjrK1hV1QPAni5IXbWHtgVcfmBlSZIkjR7vvC5JktQRg5UkSVJHDFaSJEkdMVhJkiR1xGAlSZLUEYOVJElSRwxWkiRJHTFYSZIkdcRgJUmS1BGDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaSJEkdMVhJkiR1xGAlSZLUEYOVJElSRwxWkiRJHTFYSZIkdcRgJUmS1BGDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJHDFaSJEkdMVhJkiR1xGAlSZLUEYOVJElSR/oOVkkOSvKFJJ9o88cnuS/JxiQfTXJwW35Im9/Y1q+cp9olSZKGyr4csboC2DBj/l3AdVV1AvAUcFlbfhnwVFt+XWsnSZI09voKVkmWA+cAH2jzAc4EbmtNbgLOb9PntXna+lWtvSRJ0lhLVc3dKLkN+FfAc4DfBd4A3NuOSpFkBfAfqurFSb4EnFVVm9u6vwFeXlVP7Paaq4HVABMTE798yy239FXwtie38/j39rzu5GVH9vUaC2XHjh0cccQRgy6jL9Y6PxZbrWeccca6qprsqCRJGjlL5mqQ5FxgW1WtSzLV1RtX1VpgLcDk5GRNTfX30u+7+XauXb/nsjdd3N9rLJTp6Wn67degWev8sFZJWlzmDFbA6cCrk5wNHAo8F7geWJpkSVXtBJYDW1r7LcAKYHOSJcCRwDc7r1ySJGnIzHmNVVVdXVXLq2olcBFwd1VdDNwDXNCaXQLc3qbvaPO09XdXP+cbJUmSRtyB3MfqKuCtSTYCxwA3tOU3AMe05W8F1hxYiZIkSaOhn1OBP1FV08B0m34YOHUPbb4PvKaD2iRJkkaKd16XJEnqiMFKkiSpIwYrSZKkjhisJEmSOmKwkiRJ6ojBSpIkqSMGK0mSpI4YrCRJkjpisJIkSeqIwUqSJKkjBitJkqSOGKwkSZI6YrCSJEnqiMFKkiSpIwYrSZKkjhisJEmSOmKwkiRJ6ojBSpIkqSMGK0mSpI4sGXQBXVq55pOzrt90zTkLVIkkSVqMPGIlSZLUEYOVJElSRwxWkiRJHTFYSZIkdcRgJUmS1BGDlSRJUkcMVpIkSR0xWEmSJHXEYCVJktQRg5UkSVJH5gxWSVYkuSfJQ0keTHJFW350kjuTfLV9P6otT5L3JtmY5ItJTpnvTkiSJA2Dfo5Y7QSurKqTgNOAy5OcBKwB7qqqE4G72jzAq4AT29dq4P2dVy1JkjSE5gxWVbW1qj7fpr8NbACWAecBN7VmNwHnt+nzgA9Xz73A0iTHdV24JEnSsElV9d84WQl8Fngx8PWqWtqWB3iqqpYm+QRwTVV9rq27C7iqqu7f7bVW0zuixcTExC/fcsstfdWw7cntPP69vkt+hpOXHbl/G+6nHTt2cMQRRyzoe+4va50fi63WM844Y11VTXZUkiSNnCX9NkxyBPAx4C1V9a1eluqpqkrSf0LrbbMWWAswOTlZU1NTfW33vptv59r1fZf9DJsu7u89ujI9PU2//Ro0a50f1ipJi0tfnwpM8ix6oermqvp4W/z4rlN87fu2tnwLsGLG5svbMkmSpLHWz6cCA9wAbKiqd89YdQdwSZu+BLh9xvLXt08HngZsr6qtHdYsSZI0lPo5p3Y68DpgfZIH2rK3A9cAtya5DHgEuLCt+xRwNrAR+C5waZcFS5IkDas5g1W7CD17Wb1qD+0LuPwA65IkSRo53nldkiSpIwYrSZKkjhisJEmSOmKwkiRJ6ojBSpIkqSMGK0mSpI4YrCRJkjpisJIkSeqIwUqSJKkjBitJkqSOGKwkSZI6YrCSJEnqiMFKkiSpIwYrSZKkjhisJEmSOmKwkiRJ6ojBSpIkqSMGK0mSpI4YrCRJkjpisJIkSerIkkEXMCxWrvnkfm+76ZpzOqxEkiSNqkUVrA4kPEmSJM3FU4GSJEkdMVhJkiR1xGAlSZLUEYOVJElSRwxWkiRJHTFYSZIkdcRgJUmS1BGDlSRJUkcMVpIkSR2ZlzuvJzkLuB44CPhAVV0zH+8zLPZ2R/crT97JG+bpbu/z9Ridue5Ov7/vO1+vK0nSMOn8iFWSg4B/DbwKOAl4bZKTun4fSZKkYTMfR6xOBTZW1cMASW4BzgMemof3WrQGdQTI5y2On137dG9HWD2aKEn9S1V1+4LJBcBZVfXGNv864OVV9abd2q0GVrfZFwJf6fMtjgWe6Kjc+Wat88Na50cXtb6gqp7XRTGSNIrm5RqrflTVWmDtvm6X5P6qmpyHkjpnrfPDWufHKNUqScNqPj4VuAVYMWN+eVsmSZI01uYjWP01cGKS45McDFwE3DEP7yNJkjRUOj8VWFU7k7wJ+At6t1v4YFU92OFb7PPpwwGy1vlhrfNjlGqVpKHU+cXrkiRJi5V3XpckSeqIwUqSJKkjIxWskpyV5CtJNiZZMwT1fDDJtiRfmrHs6CR3Jvlq+35UW54k7221fzHJKQtc64ok9yR5KMmDSa4Y1nqTHJrkr5L811br77flxye5r9X00fbhCJIc0uY3tvUrF6rW9v4HJflCkk8Mc52thk1J1id5IMn9bdnQ/QxI0qgamWA1pI/KuRE4a7dla4C7qupE4K42D726T2xfq4H3L1CNu+wErqyqk4DTgMvbv98w1vsD4MyqegnwUuCsJKcB7wKuq6oTgKeAy1r7y4Cn2vLrWruFdAWwYcb8sNa5yxlV9dIZ96waxp8BSRpJIxOsmPGonKr6IbDrUTkDU1WfBZ7cbfF5wE1t+ibg/BnLP1w99wJLkxy3IIUCVbW1qj7fpr9NLwgsG8Z623vuaLPPal8FnAnctpdad/XhNmBVkixErUmWA+cAH2jzGcY65zB0PwOSNKpGKVgtAx6dMb+5LRs2E1W1tU0/Bky06aGpv52CehlwH0Nabzu99gCwDbgT+Bvg6arauYd6flJrW78dOGaBSn0P8Dbgx23+mCGtc5cCPpNkXXusFAzpz4AkjaKBPdJmMaiqSjJU97NIcgTwMeAtVfWtmQdMhqneqvo74KVJlgJ/DrxosBX9rCTnAtuqal2SqQGX069fqaotSX4OuDPJl2euHKafAUkaRaN0xGpUHpXz+K7TJe37trZ84PUneRa9UHVzVX28LR7aegGq6mngHuAV9E5F7fpjYGY9P6m1rT8S+OYClHc68Ookm+idmj4TuH4I6/yJqtrSvm+jF1hPZch/BiRplIxSsBqVR+XcAVzSpi8Bbp+x/PXtk1anAdtnnH6Zd+1anhuADVX17mGuN8nz2pEqkhwG/CN614TdA1ywl1p39eEC4O5agDvfVtXVVbW8qlbS+3m8u6ouHrY6d0lyeJLn7JoGfh34EkP4MyBJo2qk7rye5Gx617TselTOOwdcz0eAKeBY4HHgHcC/B24Ffh54BLiwqp5sweaP6H2K8LvApVV1/wLW+ivAXwLr+en1QG+nd53VUNWb5B/Qu4j6IHrh/9aq+oMkv0DvyNDRwBeA36yqHyQ5FPh39K4bexK4qKoeXohaZ9Q8BfxuVZ07rHW2uv68zS4B/rSq3pnkGIbsZ0CSRtVIBStJkqRhNkqnAiVJkoaawUqSJKkjBitJkqSOGKwkSZI6YrCSJEnqiMFKkiSpIwYrSZKkjvz/l9KpqBBwjQYAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 720x720 with 6 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "data[vars_num].hist(bins=30, figsize=(10,10))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Separate data into train and test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((1047, 9), (262, 9))"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    data.drop('survived', axis=1),  # predictors\n",
    "    data['survived'],  # target\n",
    "    test_size=0.2,  # percentage of obs in test set\n",
    "    random_state=0)  # seed to ensure reproducibility\n",
    "\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering\n",
    "\n",
    "### Extract only the letter (and drop the number) from the variable Cabin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([nan, 'E', 'F', 'A', 'C', 'D', 'B', 'T', 'G'], dtype=object)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train['cabin'] = X_train['cabin'].str[0] # captures the first letter\n",
    "X_test['cabin'] = X_test['cabin'].str[0] # captures the first letter\n",
    "\n",
    "X_train['cabin'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fill in Missing data in numerical variables:\n",
    "\n",
    "- Add a binary missing indicator\n",
    "- Fill NA in original variable with the median"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age     0\n",
       "fare    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for var in ['age', 'fare']:\n",
    "\n",
    "    # add missing indicator\n",
    "    X_train[var+'_NA'] = np.where(X_train[var].isnull(), 1, 0)\n",
    "    X_test[var+'_NA'] = np.where(X_test[var].isnull(), 1, 0)\n",
    "\n",
    "    # replace NaN by median\n",
    "    median_val = X_train[var].median()\n",
    "\n",
    "    X_train[var].fillna(median_val, inplace=True)\n",
    "    X_test[var].fillna(median_val, inplace=True)\n",
    "\n",
    "X_train[['age', 'fare']].isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Replace Missing data in categorical variables with the string **Missing**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train[vars_cat] = X_train[vars_cat].fillna('Missing')\n",
    "X_test[vars_cat] = X_test[vars_cat].fillna('Missing')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pclass      0\n",
       "sex         0\n",
       "age         0\n",
       "sibsp       0\n",
       "parch       0\n",
       "fare        0\n",
       "cabin       0\n",
       "embarked    0\n",
       "title       0\n",
       "age_NA      0\n",
       "fare_NA     0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pclass      0\n",
       "sex         0\n",
       "age         0\n",
       "sibsp       0\n",
       "parch       0\n",
       "fare        0\n",
       "cabin       0\n",
       "embarked    0\n",
       "title       0\n",
       "age_NA      0\n",
       "fare_NA     0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Remove rare labels in categorical variables\n",
    "\n",
    "- remove labels present in less than 5 % of the passengers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_frequent_labels(df, var, rare_perc):\n",
    "    \n",
    "    # function finds the labels that are shared by more than\n",
    "    # a certain % of the passengers in the dataset\n",
    "    \n",
    "    df = df.copy()\n",
    "    \n",
    "    tmp = df.groupby(var)[var].count() / len(df)\n",
    "    \n",
    "    return tmp[tmp > rare_perc].index\n",
    "\n",
    "\n",
    "for var in vars_cat:\n",
    "    \n",
    "    # find the frequent categories\n",
    "    frequent_ls = find_frequent_labels(X_train, var, 0.05)\n",
    "    \n",
    "    # replace rare categories by the string \"Rare\"\n",
    "    X_train[var] = np.where(X_train[var].isin(\n",
    "        frequent_ls), X_train[var], 'Rare')\n",
    "    \n",
    "    X_test[var] = np.where(X_test[var].isin(\n",
    "        frequent_ls), X_test[var], 'Rare')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sex         2\n",
       "cabin       3\n",
       "embarked    4\n",
       "title       4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train[vars_cat].nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sex         2\n",
       "cabin       3\n",
       "embarked    3\n",
       "title       4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test[vars_cat].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Perform one hot encoding of categorical variables into k-1 binary variables\n",
    "\n",
    "- k-1, means that if the variable contains 9 different categories, we create 8 different binary variables\n",
    "- Remember to drop the original categorical variable (the one with the strings) after the encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((1047, 16), (262, 15))"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for var in vars_cat:\n",
    "    \n",
    "    # to create the binary variables, we use get_dummies from pandas\n",
    "    \n",
    "    X_train = pd.concat([X_train,\n",
    "                         pd.get_dummies(X_train[var], prefix=var, drop_first=True)\n",
    "                         ], axis=1)\n",
    "    \n",
    "    X_test = pd.concat([X_test,\n",
    "                        pd.get_dummies(X_test[var], prefix=var, drop_first=True)\n",
    "                        ], axis=1)\n",
    "    \n",
    "\n",
    "X_train.drop(labels=vars_cat, axis=1, inplace=True)\n",
    "X_test.drop(labels=vars_cat, axis=1, inplace=True)\n",
    "\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pclass</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>fare</th>\n",
       "      <th>age_NA</th>\n",
       "      <th>fare_NA</th>\n",
       "      <th>sex_male</th>\n",
       "      <th>cabin_Missing</th>\n",
       "      <th>cabin_Rare</th>\n",
       "      <th>embarked_Q</th>\n",
       "      <th>embarked_Rare</th>\n",
       "      <th>embarked_S</th>\n",
       "      <th>title_Mr</th>\n",
       "      <th>title_Mrs</th>\n",
       "      <th>title_Rare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1118</th>\n",
       "      <td>3</td>\n",
       "      <td>25.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>1</td>\n",
       "      <td>41.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>134.5000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1072</th>\n",
       "      <td>3</td>\n",
       "      <td>28.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.7333</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1130</th>\n",
       "      <td>3</td>\n",
       "      <td>18.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.7750</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>574</th>\n",
       "      <td>2</td>\n",
       "      <td>29.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>21.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      pclass   age  sibsp  parch      fare  age_NA  fare_NA  sex_male  \\\n",
       "1118       3  25.0      0      0    7.9250       0        0         1   \n",
       "44         1  41.0      0      0  134.5000       0        0         0   \n",
       "1072       3  28.0      0      0    7.7333       1        0         1   \n",
       "1130       3  18.0      0      0    7.7750       0        0         0   \n",
       "574        2  29.0      1      0   21.0000       0        0         1   \n",
       "\n",
       "      cabin_Missing  cabin_Rare  embarked_Q  embarked_Rare  embarked_S  \\\n",
       "1118              1           0           0              0           1   \n",
       "44                0           1           0              0           0   \n",
       "1072              1           0           1              0           0   \n",
       "1130              1           0           0              0           1   \n",
       "574               1           0           0              0           1   \n",
       "\n",
       "      title_Mr  title_Mrs  title_Rare  \n",
       "1118         1          0           0  \n",
       "44           0          0           0  \n",
       "1072         1          0           0  \n",
       "1130         0          0           0  \n",
       "574          1          0           0  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Note that we have one less column in the test set\n",
    "# this is because we had 1 less category in embarked.\n",
    "\n",
    "# we need to add that category manually to the test set\n",
    "\n",
    "X_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pclass</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>fare</th>\n",
       "      <th>age_NA</th>\n",
       "      <th>fare_NA</th>\n",
       "      <th>sex_male</th>\n",
       "      <th>cabin_Missing</th>\n",
       "      <th>cabin_Rare</th>\n",
       "      <th>embarked_Q</th>\n",
       "      <th>embarked_S</th>\n",
       "      <th>title_Mr</th>\n",
       "      <th>title_Mrs</th>\n",
       "      <th>title_Rare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1139</th>\n",
       "      <td>3</td>\n",
       "      <td>38.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.8958</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>533</th>\n",
       "      <td>2</td>\n",
       "      <td>21.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>21.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>459</th>\n",
       "      <td>2</td>\n",
       "      <td>42.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>27.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1150</th>\n",
       "      <td>3</td>\n",
       "      <td>28.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>14.5000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>393</th>\n",
       "      <td>2</td>\n",
       "      <td>25.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>31.5000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      pclass   age  sibsp  parch     fare  age_NA  fare_NA  sex_male  \\\n",
       "1139       3  38.0      0      0   7.8958       0        0         1   \n",
       "533        2  21.0      0      1  21.0000       0        0         0   \n",
       "459        2  42.0      1      0  27.0000       0        0         1   \n",
       "1150       3  28.0      0      0  14.5000       1        0         1   \n",
       "393        2  25.0      0      0  31.5000       0        0         1   \n",
       "\n",
       "      cabin_Missing  cabin_Rare  embarked_Q  embarked_S  title_Mr  title_Mrs  \\\n",
       "1139              1           0           0           1         1          0   \n",
       "533               1           0           0           1         0          0   \n",
       "459               1           0           0           1         1          0   \n",
       "1150              1           0           0           1         1          0   \n",
       "393               1           0           0           1         1          0   \n",
       "\n",
       "      title_Rare  \n",
       "1139           0  \n",
       "533            0  \n",
       "459            0  \n",
       "1150           0  \n",
       "393            0  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we add 0 as values for all the observations, as Rare\n",
    "# was not present in the test set\n",
    "\n",
    "X_test['embarked_Rare'] = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['pclass',\n",
       " 'age',\n",
       " 'sibsp',\n",
       " 'parch',\n",
       " 'fare',\n",
       " 'age_NA',\n",
       " 'fare_NA',\n",
       " 'sex_male',\n",
       " 'cabin_Missing',\n",
       " 'cabin_Rare',\n",
       " 'embarked_Q',\n",
       " 'embarked_Rare',\n",
       " 'embarked_S',\n",
       " 'title_Mr',\n",
       " 'title_Mrs',\n",
       " 'title_Rare']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Note that now embarked_Rare will be at the end of the test set\n",
    "# so in order to pass the variables in the same order, we will\n",
    "# create a variables variable:\n",
    "\n",
    "variables = [c  for c in X_train.columns]\n",
    "\n",
    "variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scale the variables\n",
    "\n",
    "- Use the standard scaler from Scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create scaler\n",
    "scaler = StandardScaler()\n",
    "\n",
    "#  fit  the scaler to the train set\n",
    "scaler.fit(X_train[variables]) \n",
    "\n",
    "# transform the train and test set\n",
    "X_train = scaler.transform(X_train[variables])\n",
    "\n",
    "X_test = scaler.transform(X_test[variables])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the Logistic Regression model\n",
    "\n",
    "- Set the regularization parameter to 0.0005\n",
    "- Set the seed to 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=0.0005, random_state=0)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# set up the model\n",
    "# remember to set the random_state / seed\n",
    "\n",
    "model = LogisticRegression(C=0.0005, random_state=0)\n",
    "\n",
    "# train the model\n",
    "model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make predictions and evaluate model performance\n",
    "\n",
    "Determine:\n",
    "- roc-auc\n",
    "- accuracy\n",
    "\n",
    "**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train roc-auc: 0.8431723338485316\n",
      "train accuracy: 0.7125119388729704\n",
      "\n",
      "test roc-auc: 0.8354012345679012\n",
      "test accuracy: 0.7022900763358778\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# make predictions for test set\n",
    "class_ = model.predict(X_train)\n",
    "pred = model.predict_proba(X_train)[:,1]\n",
    "\n",
    "# determine mse and rmse\n",
    "print('train roc-auc: {}'.format(roc_auc_score(y_train, pred)))\n",
    "print('train accuracy: {}'.format(accuracy_score(y_train, class_)))\n",
    "print()\n",
    "\n",
    "# make predictions for test set\n",
    "class_ = model.predict(X_test)\n",
    "pred = model.predict_proba(X_test)[:,1]\n",
    "\n",
    "# determine mse and rmse\n",
    "print('test roc-auc: {}'.format(roc_auc_score(y_test, pred)))\n",
    "print('test accuracy: {}'.format(accuracy_score(y_test, class_)))\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it! Well done\n",
    "\n",
    "**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "feml",
   "language": "python",
   "name": "feml"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
